This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
The diamention of Proserper Loan Dataset
dim(pr)
## [1] 113937 81
In total there are 81 variables that corresponding to each loan In order better understand the dataset, in this analysis, 19 variables will be selected.
pr <- subset(pr, select = c('LoanStatus',
'BorrowerAPR',
'BorrowerRate',
'LenderYield',
'ProsperScore',
'BorrowerState',
'Occupation',
'EmploymentStatus',
'IsBorrowerHomeowner',
'TotalCreditLinespast7years',
'TotalInquiries',
'BankcardUtilization',
'AvailableBankcardCredit',
'IncomeRange',
'IncomeVerifiable',
'LoanOriginalAmount',
'LoanOriginationDate',
'MonthlyLoanPayment',
'Investors'
))
str(pr)
## 'data.frame': 113937 obs. of 19 variables:
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ TotalCreditLinespast7years: int 12 29 3 29 49 49 20 10 32 32 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Prosper Score measures the loan applicant’s risk level, the higer the score the lower the risk to lend.
summary(pr$ProsperScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
table(pr$ProsperScore)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
The Highest score is 11 and the score distribution would be better presented by the following chart.
From the histogram, the majority of the ProserScore are 4, 6, and 8
Most of the loan are current and the second is the completed loans
summary(pr$LoanStatus)
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Creating a new variabel income_range to better format the IncomeRange variable
The loan data mostly coming from people who works employed or full time.
summary(pr$EmploymentStatus)
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
People whose income is between $25,000 and $100,000 applied for the loans.
summary(pr$income_range)
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 621 7274 32192 31050 16916
## $100,000+ Not employed Not displayed
## 17337 806 7741
Based on the available bank credit breakdown, most of the borrowers have the credit less than 6,000.
summary(pr$available_bank_credit_bracket)
## (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] NA's
## 33758 14002 9200 6676 5176 4055 3295 37775
Summary of available_bank_credit_bracket
Borrowers’ interest rates are between 0 tp 35%.
summary(pr$AvailableBankcardCredit)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646300 7544
Summary of AvailableBankcardCredit
summary(pr$available_bank_credit_bracket)
## (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] NA's
## 33758 14002 9200 6676 5176 4055 3295 37775
Summary of available_bank_credit_bracket
The ProsperLoan data have 113937 overvations and 19 variables ProsperSocre: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.
Applicable for loans originated after July 2009. Loanstatus: Completed, Current, Past Due (1-15 days), Defaulted, Chargedoff, Past Due (16-30 days), Cancelled, Past Due (61-90 days), Past Due (31-60 days), Past Due (91-120 days)
EmploymentStatus:Self-employed, Employed, Not available, Full-time, Other, Not employed, Part-time, Retired
IncomeRange: $0, $1-24,999, $25,000-49,999, $50,000-74,999, $75,000-99,999, $100,000+, Not employed, Not displayed
The main feature is the ProsperSocre, which measue the risk ability of the loan itself, verus the BorrowerRate
The loan applicants’ occupation,income, bankcard utilization, available bank card credit and other variables might impact the rick score when valued by th Prosper Company
Yes, I created the new variable range_new to reorder the income range variable in a ascending order.
adjust, or change the form
of the data? If so, why did you do this?
Yes. I did select 19 variables out of 81 in total. The reason for this is that not all the variables are revelent in determining the Prosper Score.
For applicants with lower income, the Prosper Score seems lower than 6, and for people with higher income, the ProsperScore seems higher, which means less risk.
tapply( pr$ProsperScore, pr$income_range,summary)
## $`$0`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 4.0 5.0 4.6 6.0 9.0 576
##
## $`$1-24,999`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.093 7.000 11.000 2620
##
## $`$25,000-49,999`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.424 7.000 11.000 8017
##
## $`$50,000-74,999`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 5.957 8.000 11.000 5423
##
## $`$75,000-99,999`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 6.297 8.000 11.000 2418
##
## $`$100,000+`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 7.000 6.738 9.000 11.000 2132
##
## $`Not employed`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.308 7.000 10.000 157
##
## $`Not displayed`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 7741
From the box plot we can see it more clearly, the median of people of income greater than $50,000 is much higher than people with income less than $50,000
If we compare ProsperScore with employment status, clearly, full time employees will have greater ProperScore, and therefore, less risky to lend money to them
tapply( pr$ProsperScore, pr$EmploymentStatus,summary)
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 2255
##
## $Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 5.973 8.000 11.000 12
##
## $`Full-time`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 6.000 8.000 7.006 9.000 11.000 18428
##
## $`Not available`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 5347
##
## $`Not employed`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.308 7.000 10.000 186
##
## $Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.167 7.000 11.000
##
## $`Part-time`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 7.000 6.801 9.000 10.000 832
##
## $Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 7.000 6.237 8.000 10.000 428
##
## $`Self-employed`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 4.000 4.444 6.000 11.000 1596
It is clearer to see from the heat map that employed and full time workers who have above average income will have higher Prosper Score. Meanwhile, for some self employed applicants, even though the income range is above $100,000, the Prosper Score is extremely low, indicting higher risk than other applicants with considerably lower income.
Unsperisingly, applicants with higher prosperscore seem to have lower borrow rate.
Summary for Prosper Score and Borrower Rate(%)
tapply(pr$BorrowerRate*100, pr$ProsperScore, summary)
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.99 29.99 31.23 30.21 31.77 35.00
##
## $`2`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.50 24.92 27.86 27.12 30.32 36.00
##
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.09 21.00 24.88 24.79 29.25 35.00
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.16 17.90 21.24 22.54 26.99 36.00
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.16 17.15 21.99 22.91 30.58 36.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.99 15.35 19.40 20.62 25.99 35.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.59 13.85 17.60 18.51 24.68 35.00
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.86 11.39 14.49 15.17 17.74 36.00
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.98 9.46 11.39 12.51 14.35 35.00
##
## $`10`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 7.160 8.790 9.797 11.590 35.000
##
## $`11`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.050 6.590 8.690 9.328 10.990 19.500
##
## Pearson's product-moment correlation
##
## data: pr$ProsperScore and pr$BorrowerRate
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6536072 -0.6458311
## sample estimates:
## cor
## -0.6497361
And also from the correlation, Prosper Score and Borrower Rate has a strong negative correlation.
Higer available bank card credit will also have lower borrow rate.
with(na.omit(pr), cor(BorrowerRate*100,AvailableBankcardCredit/1000))
## [1] -0.274097
Correlation between BorrowerRate and AvailableBankcardCredit
Other than main feature, I also noticed applicants with higher available bank card credit will also have lower borrow rate.
Borrow rate verus the properscore, The coefficient between them is -0.6682872.
The same pattern also occur for different employment status
When comparing with the main relationship ProsperScore and BorrowerRate, we also noticed a liner relationship adding the category variable income range
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ 0 + I(ProsperScore), data = subset(pr,
## !is.na(ProsperScore)))
## m2: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit -
## 1, data = subset(pr, !is.na(ProsperScore)))
## m3: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange - 1, data = subset(pr, !is.na(ProsperScore)))
## m4: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange + EmploymentStatus - 1, data = subset(pr, !is.na(ProsperScore)))
## m5: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit +
## IncomeRange + EmploymentStatus + TotalCreditLinespast7years -
## 1, data = subset(pr, !is.na(ProsperScore)))
##
## ============================================================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------------------------------------
## I(ProsperScore) 0.026*** 0.028*** -0.018*** -0.019*** -0.019***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## AvailableBankcardCredit -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000)
## IncomeRange: $0 0.352*** 0.348*** 0.348***
## (0.008) (0.008) (0.008)
## IncomeRange: $1-24,999 0.334*** 0.336*** 0.336***
## (0.001) (0.001) (0.001)
## IncomeRange: $100,000+ 0.304*** 0.307*** 0.307***
## (0.001) (0.001) (0.001)
## IncomeRange: $25,000-49,999 0.317*** 0.319*** 0.318***
## (0.001) (0.001) (0.001)
## IncomeRange: $50,000-74,999 0.308*** 0.310*** 0.309***
## (0.001) (0.001) (0.001)
## IncomeRange: $75,000-99,999 0.306*** 0.309*** 0.308***
## (0.001) (0.001) (0.001)
## IncomeRange: Not employed 0.364*** 0.368*** 0.367***
## (0.002) (0.002) (0.002)
## EmploymentStatus: Full-time/Employed 0.025*** 0.025***
## (0.001) (0.001)
## EmploymentStatus: Other/Employed -0.003** -0.003**
## (0.001) (0.001)
## EmploymentStatus: Part-time/Employed 0.016*** 0.016***
## (0.003) (0.003)
## EmploymentStatus: Retired/Employed 0.020*** 0.020***
## (0.003) (0.003)
## EmploymentStatus: Self-employed/Employed -0.009*** -0.009***
## (0.001) (0.001)
## TotalCreditLinespast7years 0.000
## (0.000)
## ----------------------------------------------------------------------------------------------------------------------------
## R-squared 0.612 0.621 0.932 0.933 0.933
## adj. R-squared 0.612 0.621 0.932 0.933 0.933
## sigma 0.131 0.129 0.055 0.054 0.054
## F 133697.431 69606.628 128923.549 84690.770 79044.868
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 52266.161 53323.957 126092.434 126950.145 126950.686
## Deviance 1449.358 1413.669 254.367 249.276 249.273
## AIC -104528.323 -106641.915 -252164.869 -253870.290 -253869.372
## BIC -104509.626 -106613.869 -252071.382 -253730.059 -253719.793
## N 84853 84853 84853 84853 84853
## ============================================================================================================================
From multivariate analysis, the borrower’s rate is also affacted by other variables such as income range, employment status, aviable bank credits.
Holding the borrower’s rate constant, the employed status will have lower prosper score.
Yes, I did. For the linear models I created, R square is about 93%, which means almost 93% percent of the variation can be explained by the model. I also excluded the intercept, which strenthed the linear relationship between the variables.
The loans were valued by the risk levels which bing called as ProsperScore, and the greater the score the lower the risk. Histogram showed us the counts for different ProsperScore. Majority of the loans have the score between 4 and 9.
Ggplot gives us the negative correlation between ProsperScore and the borrower’s rate, the higher the score seems to lead to a lower borrower’s rate.
When considering other variables, for example, income range, will also help us in predicting the borrower’s rate.From the plot, the higer income leads also have higer ProsperScore, and revelvantly lower interest rate.
Prosper Loan dataset has thorogh loan data regrading theri unique attributes, and when evaluating the loan applications, these variables could benefit the company in deciding the accurate rate. From the the analysis, it is shown that borrower’s rate the highly correlated with borrower’s ProsperScore, which measured the risk of the applicant. We also learned that other factors such as income level, employment status, avaible bank credits could also affect borrower’s rate. More thorough demographic data can be included in the dataset, therefore, we can better the detailed attributs of the applicants.